A General Practical Approach to PatternMatching over Ziv - Lempel Compressed
نویسنده
چکیده
We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without un-compressing it. This is a highly relevant issue to keep compressed text databases where eecient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Ziv-Lempel compression. We then apply the scheme to each particular type of compression. We present the rst algorithm to nd all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more eecient search algorithm, which is faster than uncompress-ing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78.
منابع مشابه
A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text
We address in this paper the problem of string matching on Lempel-Ziv compressed text. The goal is to search a pattern in a text without uncompressing. This is a highly relevant issue, since it is essential to have compressed text databases where eecient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts th...
متن کاملA Unifying Framework for Compressed Pattern Matching
We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-...
متن کاملByte pair encoding : a text compression scheme that accelerates pattern matching
Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...
متن کاملCHICO: A Compressed Hybrid Index for Repetitive Collections
Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...
متن کاملApproximate String Matching over Ziv - LempelCompressed
We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, speciically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn + R) ti...
متن کامل